# Image-text generation
Gemma 3 12b It Quantized.w8a8
An INT8 quantized version based on google/gemma-3-12b-it, supporting visual text input and text output, suitable for efficient inference deployment
Image-to-Text
Transformers

G
RedHatAI
237
1
Xlangai Jedi 3B 1080p GGUF
Apache-2.0
Jedi-3B-1080p is a 3B-parameter model developed by xlangai, quantized using llama.cpp, suitable for image-text generation tasks.
Large Language Model English
X
bartowski
148
1
Medgemma 4b It GGUF
Other
medgemma-4b-it is a multimodal model focused on the medical field, capable of processing image and text inputs, and suitable for multiple medical scenarios such as radiology and clinical reasoning.
Text-to-Image
Transformers

M
second-state
564
1
Internvl3 8B Hf
Other
InternVL3 is an advanced multimodal large language model series with powerful multimodal perception and reasoning capabilities, supporting image, video, and text inputs.
Image-to-Text
Transformers Other

I
OpenGVLab
454
1
Internvl3 2B Hf
Other
InternVL3-2B is a multimodal large language model implemented based on the Hugging Face Transformers library. It performs excellently in multimodal tasks such as image, video, and text processing, supporting multiple input methods and efficient batch inference.
Image-to-Text
Transformers Other

I
OpenGVLab
41.22k
2
Kimi VL A3B Thinking 8bit
Other
Kimi-VL-A3B-Thinking-8bit is a multimodal vision-language model converted based on the MLX format, supporting image-text to text generation tasks.
Image-to-Text
Transformers Other

K
mlx-community
1,738
1
Gemma 3 27b It Qat Bf16
Gemma 3 27B IT QAT BF16 is a version of the Gemma series of models released by Google. It has undergone quantization-aware training (QAT) and is converted to the BF16 format, suitable for the MLX framework.
Image-to-Text
Transformers

G
mlx-community
178
2
Gemma 3 12b It Qat Int4 Unquantized
Gemma 3 is a lightweight multimodal open model from Google, supporting text and image inputs with text output, featuring a 128K large context window and multilingual capabilities.
Image-to-Text
Transformers

G
google
1,358
9
Gemma 3 4b It Int4 Awq
Gemma is a lightweight, advanced open model series from Google, built using the same research technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.
Text-to-Image
Transformers

G
gaunernst
1,054
1
Smoldocling 256M Preview Mlx Fp16
Apache-2.0
This model is converted from ds4sd/SmolDocling-256M-preview to the MLX format, supporting image-text-to-text tasks.
Image-to-Text
Transformers English

S
ahishamm
24
1
Bytedance Research.ui TARS 72B SFT GGUF
A 72B-parameter multimodal foundation model released by ByteDance Research, specializing in image-text-to-text tasks
Image-to-Text
B
DevQuasar
81
1
Aya Vision 8b
Aya Vision 8B is an open-weight 8-billion-parameter multilingual vision-language model supporting visual and language tasks in 23 languages.
Image-to-Text
Transformers Supports Multiple Languages

A
CohereLabs
29.94k
282
Gemma 3 12b Pt
Gemma is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.
Image-to-Text
Transformers

G
google
54.36k
46
Aria Sequential Mlp FP8 Dynamic
Apache-2.0
FP8 dynamically quantized model based on Aria-sequential_mlp, suitable for image-text-to-text tasks, requiring approximately 30GB VRAM.
Image-to-Text
Transformers

A
leon-se
94
6
Florence 2 Flux Large
Apache-2.0
A vision-language model based on Microsoft Florence-2-large, excelling in image understanding and text generation tasks
Image-to-Text
Transformers Supports Multiple Languages

F
gokaygokay
14.96k
45
Idefics 9b
Other
IDEFICS is an open-source multimodal model capable of processing both image and text inputs to generate text outputs, serving as an open-source reproduction of Deepmind's Flamingo model.
Image-to-Text
Transformers English

I
HuggingFaceM4
3,676
46
Blip2 Image To Text
MIT
BLIP-2 is a vision-language pre-trained model that achieves language-image pre-training guidance by freezing the image encoder and large language model.
Image-to-Text
Transformers English

B
paragon-AI
343
27
Featured Recommended AI Models